Background and Motivation

As someone who is thinking about going into the field of Data Science and Machine Learning, I have looked through online job ads for data scientists and got overwhelmed by the amount of skills required. It would take a very long time to learn the whole repertoire of data science tools.

To help decide which tools to learn first, I decided to look at which data science skills are currently most in-demand by employers in major cities in Canada and U.S.

Just like any sane person would do these days, I googled “web scraping indeed” to see if I can scrape its job ads to analyse them. I came across this awesome blog by Jesse Steinweg-Woods published in March 2015. He looked at the data science tools mentioned in the job ads from indeed for Seattle, Chicago, San Francisco, New York and Nationwide.

His findings from nationwide search suggested that Python is much more in demand than R. This really intrigued me to see if the trends he found a year ago are still the same.

Goals

My initial goals were to 1. To examine the Data Science skills that most employers are looking for in Canada and U.S. 2. To examine the ranking of Canadian and American cities based on opportunities for Data Scientists

However, as I searched for job ads for data scientists in Canada, I came to realize that there weren’t enough ads for most of the cities. So I decided to focus on the national trend for Canada.

As for U.S., I initially wanted to look at the top 100 largest cities. During the data collection process, some of the cities didn’t have enough or no ads either. So I decided to focus on the following 53 cities, that had enough job ads for data scientists, instead.

While gathering data from Data-Scientists job ads, I started to wonder if the skills, that the employers are looking for, would be different for other data-driven jobs, specifically for Data Analysts, Data Architects and Data Engineers. So, I decided to also start collecting data for those as well.

Data collection

Source of data

I used indeed.com and indeed.ca to gather job ads.

Web scraping

I wanted to use rvest to scrape data from indeed. However, after spending a few days of trying, I couldn’t get the scraping function to work, since indeed is a job search aggregator and the links it provides go to different company websites that have different structure. This is my first time scraping so I probably didn’t know enough R tools to make it work.

So, I decided to see if I could modify Jesse’s ipython code to get .csv files of the wordcounts of the names of the Data Science Tools from the job ads. His code was written to produce plots of the results right after scraping, not produce .csv files.

You can find the ipython notebooks here.

You may see in the ipython notebooks that the code was a bit repetitive. I was trying to stick to d.r.y (don’t repeat yourself) principle but I have never programmed in Python, so this will be a work in progress for me to make the code neater. In the meantime, this code was enough to collect the data.

Data was collected for job ads that had exactly “Data Analyst”, “Data Scientist”, “Data Architect” or “Data Engineer” in its text. Since the search was done for each job title for each of the 53 cities mentioned above, I ended up with 212 (53 times 4) .csv files. I had to do quite a bit of wrangling to comebine them together.

Out of the 53 cities, 14 of them did not have enough (more than 10) or no job postings for data engineers. Therefore, I used 39 cities to compare the skills among “Data Anlysts”, “Data Scientists”, “Data Architects” and “Data Engineers”. I also used data from 53 cities to compare the skills among “Data Anlysts”, “Data Scientists” and “Data Architects”.

The data collected include the percentage of job ads that contain the names of the analysis tools or programming languages, specified below, and the number of job postings per each job title in each city.

  • R
  • Python
  • Java
  • C++
  • Ruby
  • Perl
  • Matlab
  • Javascript
  • Scala
  • Excel
  • Tableau
  • D3.js
  • SAS
  • SPSS
  • D3
  • Hadoop
  • MapReduce
  • Spark
  • Pig
  • Hive
  • Shark
  • Oozie
  • Zookeeper
  • Flume
  • Mahout
  • SQL
  • NoSQL
  • HBase
  • Cassandra
  • MongoDB

Data Wrangling

I had separate .csv files of word count percentages (Data Science Tools) for DAnalysts, DScientists, DArchitects and DEngineer per city. I’m combining them to analyse the data.

library(readr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Joining datasets:

#NewYork
NewYork1 <- read_csv("NewYorkSkills1.csv") #.csv for Data Analysts' job ad data
NewYork2 <- read_csv("NewYorkSkills2.csv") #.csv for Data Scientists' job ad data
NewYork3 <- read_csv("NewYorkSkills3.csv") #.csv for Data Architects' job ad data
NewYork4 <- read_csv("NewYorkSkills4.csv") #.csv for Data Engineers' job ad data

names(NewYork1) <- c("Term", "DAnalyst")
names(NewYork2) <- c("Term", "DScientist")
names(NewYork3) <- c("Term", "DArchitect")
names(NewYork4) <- c("Term", "DEngineer")

j1 <- full_join(NewYork1, NewYork2, by="Term") 
j2 <- full_join(NewYork3, NewYork4, by="Term")
NewYork <- full_join(j1, j2, by="Term")

NewYork$City <- "New York"
NewYork$State <- "NY"
NewYork <- replace_na(NewYork, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
#Since NAs mean the term didn't appear in the job postings and thus assigned zero.

write_csv( NewYork, path = "NewYork.csv") 

#Anaheim
Anaheim1 <- read_csv("AnaheimSkills1.csv")
Anaheim2 <- read_csv("AnaheimSkills2.csv")
Anaheim3 <- read_csv("AnaheimSkills3.csv")
Anaheim4 <- read_csv("AnaheimSkills4.csv")

names(Anaheim1) <- c("Term", "DAnalyst")
names(Anaheim2) <- c("Term", "DScientist")
names(Anaheim3) <- c("Term", "DArchitect")
names(Anaheim4) <- c("Term", "DEngineer")

j1 <- full_join(Anaheim1, Anaheim2, by="Term") 
j2 <- full_join(Anaheim3, Anaheim4, by="Term")
Anaheim <- full_join(j1, j2, by="Term")

Anaheim$City <- "Anaheim"
Anaheim$State <- "CA"

Anaheim <- replace_na(Anaheim, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Anaheim, path = "Anaheim.csv") 

#Arlington
Arlington1 <- read_csv("ArlingtonSkills1.csv")
Arlington2 <- read_csv("ArlingtonSkills2.csv")
Arlington3 <- read_csv("ArlingtonSkills3.csv")
Arlington4 <- read_csv("ArlingtonSkills4.csv")

names(Arlington1) <- c("Term", "DAnalyst")
names(Arlington2) <- c("Term", "DScientist")
names(Arlington3) <- c("Term", "DArchitect")
names(Arlington4) <- c("Term", "DEngineer")

j1 <- full_join(Arlington1, Arlington2, by="Term") 
j2 <- full_join(Arlington3, Arlington4, by="Term")
Arlington <- full_join(j1, j2, by="Term")

Arlington$City <- "Arlington"
Arlington$State <- "TX"

Arlington <- replace_na(Arlington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Arlington, path = "Arlington.csv") 

#Atlanta
Atlanta1 <- read_csv("AtlantaSkills1.csv")
Atlanta2 <- read_csv("AtlantaSkills2.csv")
Atlanta3 <- read_csv("AtlantaSkills3.csv")
Atlanta4 <- read_csv("AtlantaSkills4.csv")

names(Atlanta1) <- c("Term", "DAnalyst")
names(Atlanta2) <- c("Term", "DScientist")
names(Atlanta3) <- c("Term", "DArchitect")
names(Atlanta4) <- c("Term", "DEngineer")

j1 <- full_join(Atlanta1, Atlanta2, by="Term") 
j2 <- full_join(Atlanta3, Atlanta4, by="Term")
Atlanta <- full_join(j1, j2, by="Term")

Atlanta$City <- "Atlanta"
Atlanta$State <- "GA"

Atlanta <- replace_na(Atlanta, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Atlanta, path = "Atlanta.csv") 


#Aurora
Aurora1 <- read_csv("AuroraSkills1.csv")
Aurora2 <- read_csv("AuroraSkills2.csv")
Aurora3 <- read_csv("AuroraSkills3.csv")
Aurora4 <- read_csv("AuroraSkills4.csv")

names(Aurora1) <- c("Term", "DAnalyst")
names(Aurora2) <- c("Term", "DScientist")
names(Aurora3) <- c("Term", "DArchitect")
names(Aurora4) <- c("Term", "DEngineer")

j1 <- full_join(Aurora1, Aurora2, by="Term") 
j2 <- full_join(Aurora3, Aurora4, by="Term")
Aurora <- full_join(j1, j2, by="Term")

Aurora$City <- "Aurora"
Aurora$State <- "CO"

Aurora <- replace_na(Aurora, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Aurora, path = "Aurora.csv") 

#Austin
Austin1 <- read_csv("AustinSkills1.csv")
Austin2 <- read_csv("AustinSkills2.csv")
Austin3 <- read_csv("AustinSkills3.csv")
Austin4 <- read_csv("AustinSkills4.csv")

names(Austin1) <- c("Term", "DAnalyst")
names(Austin2) <- c("Term", "DScientist")
names(Austin3) <- c("Term", "DArchitect")
names(Austin4) <- c("Term", "DEngineer")

j1 <- full_join(Austin1, Austin2, by="Term") 
j2 <- full_join(Austin3, Austin4, by="Term")
Austin <- full_join(j1, j2, by="Term")

Austin$City <- "Austin"
Austin$State <- "TX"

Austin <- replace_na(Austin, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv( Austin, path = "Austin.csv") 

#Baltimore
Baltimore1 <- read_csv("BaltimoreSkills1.csv")
Baltimore2 <- read_csv("BaltimoreSkills2.csv")
Baltimore3 <- read_csv("BaltimoreSkills3.csv")
Baltimore4 <- read_csv("BaltimoreSkills4.csv")

names(Baltimore1) <- c("Term", "DAnalyst")
names(Baltimore2) <- c("Term", "DScientist")
names(Baltimore3) <- c("Term", "DArchitect")
names(Baltimore4) <- c("Term", "DEngineer")

j1 <- full_join(Baltimore1, Baltimore2, by="Term") 
j2 <- full_join(Baltimore3, Baltimore4, by="Term")
Baltimore <- full_join(j1, j2, by="Term")

Baltimore$City <- "Baltimore"
Baltimore$State <- "MD"

Baltimore <- replace_na(Baltimore, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Baltimore, path = "Baltimore.csv") 


#Boston
Boston1 <- read_csv("BostonSkills1.csv")
Boston2 <- read_csv("BostonSkills2.csv")
Boston3 <- read_csv("BostonSkills3.csv")
Boston4 <- read_csv("BostonSkills4.csv")

names(Boston1) <- c("Term", "DAnalyst")
names(Boston2) <- c("Term", "DScientist")
names(Boston3) <- c("Term", "DArchitect")
names(Boston4) <- c("Term", "DEngineer")

j1 <- full_join(Boston1, Boston2, by="Term") 
j2 <- full_join(Boston3, Boston4, by="Term")
Boston <- full_join(j1, j2, by="Term")

Boston$City <- "Boston"
Boston$State <- "MA"

Boston <- replace_na(Boston, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Boston, path = "Boston.csv") 

#Boulder
Boulder1 <- read_csv("BoulderSkills1.csv")
Boulder2 <- read_csv("BoulderSkills2.csv")
Boulder3 <- read_csv("BoulderSkills3.csv")
Boulder4 <- read_csv("BoulderSkills4.csv")

names(Boulder1) <- c("Term", "DAnalyst")
names(Boulder2) <- c("Term", "DScientist")
names(Boulder3) <- c("Term", "DArchitect")
names(Boulder4) <- c("Term", "DEngineer")

j1 <- full_join(Boulder1, Boulder2, by="Term") 
j2 <- full_join(Boulder3, Boulder4, by="Term")
Boulder <- full_join(j1, j2, by="Term")

Boulder$City <- "Boulder"
Boulder$State <- "CO"

Boulder <- replace_na(Boulder, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Boulder, path = "Boulder.csv") 

#Chandler
a <- read_csv("ChandlerSkills1.csv")
b <- read_csv("ChandlerSkills2.csv")
c <- read_csv("ChandlerSkills3.csv")
d <- read_csv("ChandlerSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Chandler <- full_join(j1, j2, by="Term")

Chandler$City <- "Chandler"
Chandler$State <- "AZ"

Chandler <- replace_na(Chandler, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Chandler, path = "Chandler.csv") 

#Charlotte
a <- read_csv("CharlotteSkills1.csv")
b <- read_csv("CharlotteSkills2.csv")
c <- read_csv("CharlotteSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Charlotte <- full_join(j1, c, by="Term")

Charlotte$City <- "Charlotte"
Charlotte$State <- "NC"

Charlotte <- replace_na(Charlotte, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Charlotte, path = "Charlotte.csv") 

#Chicago
Chicago1 <- read_csv("ChicagoSkills1.csv")
Chicago2 <- read_csv("ChicagoSkills2.csv")
Chicago3 <- read_csv("ChicagoSkills3.csv")
Chicago4 <- read_csv("ChicagoSkills4.csv")

names(Chicago1) <- c("Term", "DAnalyst")
names(Chicago2) <- c("Term", "DScientist")
names(Chicago3) <- c("Term", "DArchitect")
names(Chicago4) <- c("Term", "DEngineer")

j1 <- full_join(Chicago1, Chicago2, by="Term") 
j2 <- full_join(Chicago3, Chicago4, by="Term")
Chicago <- full_join(j1, j2, by="Term")

Chicago$City <- "Chicago"
Chicago$State <- "IL"

Chicago <- replace_na(Chicago, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Chicago, path = "Chicago.csv") 

#Cincinnati
a <- read_csv("CincinnatiSkills1.csv")
b <- read_csv("CincinnatiSkills2.csv")
c <- read_csv("CincinnatiSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Cincinnati <- full_join(j1, c, by="Term")

Cincinnati$City <- "Cincinnati"
Cincinnati$State <- "OH"

Cincinnati <- replace_na(Cincinnati, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Cincinnati, path = "Cincinnati.csv") 

#Columbus
a <- read_csv("ColumbusSkills1.csv")
b <- read_csv("ColumbusSkills2.csv")
c <- read_csv("ColumbusSkills3.csv")
d <- read_csv("ColumbusSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Columbus <- full_join(j1, j2, by="Term")

Columbus$City <- "Columbus"
Columbus$State <- "OH"

Columbus <- replace_na(Columbus, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Columbus, path = "Columbus.csv") 

#Dallas
a <- read_csv("DallasSkills1.csv")
b <- read_csv("DallasSkills2.csv")
c <- read_csv("DallasSkills3.csv")
d <- read_csv("DallasSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Dallas <- full_join(j1, j2, by="Term")

Dallas$City <- "Dallas"
Dallas$State <- "TX"

Dallas <- replace_na(Dallas, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Dallas, path = "Dallas.csv") 

#Denver
a <- read_csv("DenverSkills1.csv")
b <- read_csv("DenverSkills2.csv")
c <- read_csv("DenverSkills3.csv")
d <- read_csv("DenverSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Denver <- full_join(j1, j2, by="Term")

Denver$City <- "Denver"
Denver$State <- "CO"

Denver <- replace_na(Denver, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Denver, path = "Denver.csv") 

#Detroit
a <- read_csv("DetroitSkills1.csv")
b <- read_csv("DetroitSkills2.csv")
c <- read_csv("DetroitSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Detroit <- full_join(j1, c, by="Term")

Detroit$City <- "Detroit"
Detroit$State <- "MI"

Detroit <- replace_na(Detroit, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Detroit, path = "Detroit.csv") 

#Durham
a <- read_csv("DurhamSkills1.csv")
b <- read_csv("DurhamSkills2.csv")
c <- read_csv("DurhamSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Durham <- full_join(j1, c, by="Term")

Durham$City <- "Durham"
Durham$State <- "NC"

Durham <- replace_na(Durham, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Durham, path = "Durham.csv") 

#Fremont
a <- read_csv("FremontSkills1.csv")
b <- read_csv("FremontSkills2.csv")
c <- read_csv("FremontSkills3.csv")
d <- read_csv("FremontSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Fremont <- full_join(j1, j2, by="Term")

Fremont$City <- "Fremont"
Fremont$State <- "CA"

Fremont <- replace_na(Fremont, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Fremont, path = "Fremont.csv") 

#FortWorth
a <- read_csv("FortWorthSkills1.csv")
b <- read_csv("FortWorthSkills2.csv")
c <- read_csv("FortWorthSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
FortWorth <- full_join(j1, c, by="Term")

FortWorth$City <- "FortWorth"
FortWorth$State <- "TX"

FortWorth <- replace_na(FortWorth, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(FortWorth, path = "FortWorth.csv") 

#Garland
a <- read_csv("GarlandSkills1.csv")
b <- read_csv("GarlandSkills2.csv")
c <- read_csv("GarlandSkills3.csv")
d <- read_csv("GarlandSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Garland <- full_join(j1, j2, by="Term")

Garland$City <- "Garland"
Garland$State <- "TX"

Garland <- replace_na(Garland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Garland, path = "Garland.csv") 

#Gilbert
a <- read_csv("GilbertSkills1.csv")
b <- read_csv("GilbertSkills2.csv")
c <- read_csv("GilbertSkills3.csv")
d <- read_csv("GilbertSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Gilbert <- full_join(j1, j2, by="Term")

Gilbert$City <- "Gilbert"
Gilbert$State <- "AZ"

Gilbert <- replace_na(Gilbert, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Gilbert, path = "Gilbert.csv") 

#Glendale
a <- read_csv("GlendaleSkills1.csv")
b <- read_csv("GlendaleSkills2.csv")
c <- read_csv("GlendaleSkills3.csv")
d <- read_csv("GlendaleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Glendale <- full_join(j1, j2, by="Term")

Glendale$City <- "Glendale"
Glendale$State <- "AZ"

Glendale <- replace_na(Glendale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Glendale, path = "Glendale.csv")

#Hialeah
a <- read_csv("HialeahSkills1.csv")
b <- read_csv("HialeahSkills2.csv")
c <- read_csv("HialeahSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Hialeah <- full_join(j1, c, by="Term")

Hialeah$City <- "Hialeah"
Hialeah$State <- "FL"

Hialeah <- replace_na(Hialeah, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Hialeah, path = "Hialeah.csv") 

#Houston
a <- read_csv("HoustonSkills1.csv")
b <- read_csv("HoustonSkills2.csv")
c <- read_csv("HoustonSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Houston <- full_join(j1, c, by="Term")

Houston$City <- "Houston"
Houston$State <- "TX"

Houston <- replace_na(Houston, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Houston, path = "Houston.csv") 

#Irvine
a <- read_csv("IrvineSkills1.csv")
b <- read_csv("IrvineSkills2.csv")
c <- read_csv("IrvineSkills3.csv")
d <- read_csv("IrvineSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Irvine <- full_join(j1, j2, by="Term")

Irvine$City <- "Irvine"
Irvine$State <- "CA"

Irvine <- replace_na(Irvine, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Irvine, path = "Irvine.csv") 

#Irving
a <- read_csv("IrvingSkills1.csv")
b <- read_csv("IrvingSkills2.csv")
c <- read_csv("IrvingSkills3.csv")
d <- read_csv("IrvingSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Irving <- full_join(j1, j2, by="Term")

Irving$City <- "Irving"
Irving$State <- "TX"

Irving <- replace_na(Irving, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Irving, path = "Irving.csv") 

#Jersey
a <- read_csv("JerseySkills1.csv")
b <- read_csv("JerseySkills2.csv")
c <- read_csv("JerseySkills3.csv")
d <- read_csv("JerseySkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Jersey <- full_join(j1, j2, by="Term")

Jersey$City <- "Jersey"
Jersey$State <- "NJ"

Jersey <- replace_na(Jersey, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Jersey, path = "Jersey.csv") 

#LongBeach
a <- read_csv("LongBeachSkills1.csv")
b <- read_csv("LongBeachSkills2.csv")
c <- read_csv("LongBeachSkills3.csv")
d <- read_csv("LongBeachSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
LongBeach <- full_join(j1, j2, by="Term")

LongBeach$City <- "LongBeach"
LongBeach$State <- "CA"

LongBeach <- replace_na(LongBeach, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(LongBeach, path = "LongBeach.csv") 

#LosAngeles
a <- read_csv("LosAngelesSkills1.csv")
b <- read_csv("LosAngelesSkills2.csv")
c <- read_csv("LosAngelesSkills3.csv")
d <- read_csv("LosAngelesSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
LosAngeles <- full_join(j1, j2, by="Term")

LosAngeles$City <- "Los Angeles"
LosAngeles$State <- "CA"

LosAngeles <- replace_na(LosAngeles, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(LosAngeles, path = "LosAngeles.csv") 

#Mesa
a <- read_csv("MesaSkills1.csv")
b <- read_csv("MesaSkills2.csv")
c <- read_csv("MesaSkills3.csv")
d <- read_csv("MesaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Mesa <- full_join(j1, j2, by="Term")

Mesa$City <- "Mesa"
Mesa$State <- "AZ"

Mesa <- replace_na(Mesa, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Mesa, path = "Mesa.csv") 

#Miami
a <- read_csv("MiamiSkills1.csv")
b <- read_csv("MiamiSkills2.csv")
c <- read_csv("MiamiSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Miami <- full_join(j1, c, by="Term")

Miami$City <- "Miami"
Miami$State <- "FL"

Miami <- replace_na(Miami, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Miami, path = "Miami.csv") 

#Minneapolis
a <- read_csv("MinneapolisSkills1.csv")
b <- read_csv("MinneapolisSkills2.csv")
c <- read_csv("MinneapolisSkills3.csv")
d <- read_csv("MinneapolisSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Minneapolis <- full_join(j1, j2, by="Term")

Minneapolis$City <- "Minneapolis"
Minneapolis$State <- "MN"

Minneapolis <- replace_na(Minneapolis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Minneapolis, path = "Minneapolis.csv") 

#Nashville
a <- read_csv("NashvilleSkills1.csv")
b <- read_csv("NashvilleSkills2.csv")
c <- read_csv("NashvilleSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Nashville <- full_join(j1, c, by="Term")

Nashville$City <- "Nashville"
Nashville$State <- "TN"

Nashville <- replace_na(Nashville, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Nashville, path = "Nashville.csv") 

#Newark
a <- read_csv("NewarkSkills1.csv")
b <- read_csv("NewarkSkills2.csv")
c <- read_csv("NewarkSkills3.csv")
d <- read_csv("NewarkSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Newark <- full_join(j1, j2, by="Term")

Newark$City <- "Newark"
Newark$State <- "NJ"

Newark <- replace_na(Newark, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Newark, path = "Newark.csv") 

#Oakland
a <- read_csv("OaklandSkills1.csv")
b <- read_csv("OaklandSkills2.csv")
c <- read_csv("OaklandSkills3.csv")
d <- read_csv("OaklandSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Oakland <- full_join(j1, j2, by="Term")

Oakland$City <- "Oakland"
Oakland$State <- "CA"

Oakland <- replace_na(Oakland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Oakland, path = "Oakland.csv") 

#Philadelphia
a <- read_csv("PhiladelphiaSkills1.csv")
b <- read_csv("PhiladelphiaSkills2.csv")
c <- read_csv("PhiladelphiaSkills3.csv")
d <- read_csv("PhiladelphiaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Philadelphia <- full_join(j1, j2, by="Term")

Philadelphia$City <- "Philadelphia"
Philadelphia$State <- "PA"

Philadelphia <- replace_na(Philadelphia, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Philadelphia, path = "Philadelphia.csv") 

#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Pittsburgh <- full_join(j1, c, by="Term")

Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"

Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Pittsburgh, path = "Pittsburgh.csv") 

#Phoenix
a <- read_csv("PhoenixSkills1.csv")
b <- read_csv("PhoenixSkills2.csv")
c <- read_csv("PhoenixSkills3.csv")
d <- read_csv("PhoenixSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Phoenix <- full_join(j1, j2, by="Term")

Phoenix$City <- "Phoenix"
Phoenix$State <- "AZ"

Phoenix <- replace_na(Phoenix, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Phoenix, path = "Phoenix.csv") 

#Plano
a <- read_csv("PlanoSkills1.csv")
b <- read_csv("PlanoSkills2.csv")
c <- read_csv("PlanoSkills3.csv")
d <- read_csv("PlanoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Plano <- full_join(j1, j2, by="Term")

Plano$City <- "Plano"
Plano$State <- "TX"

Plano <- replace_na(Plano, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Plano, path = "Plano.csv") 

#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Pittsburgh <- full_join(j1, c, by="Term")

Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"

Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Pittsburgh, path = "Pittsburgh.csv") 

#Portland
a <- read_csv("PortlandSkills1.csv")
b <- read_csv("PortlandSkills2.csv")
c <- read_csv("PortlandSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Portland <- full_join(j1, c, by="Term")

Portland$City <- "Portland"
Portland$State <- "OR"

Portland <- replace_na(Portland, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Portland, path = "Portland.csv") 

#Raleigh
a <- read_csv("RaleighSkills1.csv")
b <- read_csv("RaleighSkills2.csv")
c <- read_csv("RaleighSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Raleigh <- full_join(j1, c, by="Term")

Raleigh$City <- "Raleigh"
Raleigh$State <- "NC"

Raleigh <- replace_na(Raleigh, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Raleigh, path = "Raleigh.csv") 

#SanDiego
a <- read_csv("SanDiegoSkills1.csv")
b <- read_csv("SanDiegoSkills2.csv")
c <- read_csv("SanDiegoSkills3.csv")
d <- read_csv("SanDiegoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanDiego <- full_join(j1, j2, by="Term")

SanDiego$City <- "SanDiego"
SanDiego$State <- "CA"

SanDiego <- replace_na(SanDiego, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanDiego, path = "SanDiego.csv") 

#SanFrancisco
a <- read_csv("SanFranciscoSkills1.csv")
b <- read_csv("SanFranciscoSkills2.csv")
c <- read_csv("SanFranciscoSkills3.csv")
d <- read_csv("SanFranciscoSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanFrancisco <- full_join(j1, j2, by="Term")

SanFrancisco$City <- "San Francisco"
SanFrancisco$State <- "CA"

SanFrancisco <- replace_na(SanFrancisco, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanFrancisco, path = "SanFrancisco.csv") 

#SanJose
a <- read_csv("SanJoseSkills1.csv")
b <- read_csv("SanJoseSkills2.csv")
c <- read_csv("SanJoseSkills3.csv")
d <- read_csv("SanJoseSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SanJose <- full_join(j1, j2, by="Term")

SanJose$City <- "San Jose"
SanJose$State <- "CA"

SanJose <- replace_na(SanJose, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SanJose, path = "SanJose.csv") 

#SantaAna
a <- read_csv("SantaAnaSkills1.csv")
b <- read_csv("SantaAnaSkills2.csv")
c <- read_csv("SantaAnaSkills3.csv")
d <- read_csv("SantaAnaSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
SantaAna <- full_join(j1, j2, by="Term")

SantaAna$City <- "Santa Ana"
SantaAna$State <- "CA"

SantaAna <- replace_na(SantaAna, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(SantaAna, path = "SantaAna.csv") 

#Scottsdale
a <- read_csv("ScottsdaleSkills1.csv")
b <- read_csv("ScottsdaleSkills2.csv")
c <- read_csv("ScottsdaleSkills3.csv")
d <- read_csv("ScottsdaleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Scottsdale <- full_join(j1, j2, by="Term")

Scottsdale$City <- "Scottsdale"
Scottsdale$State <- "AZ"

Scottsdale <- replace_na(Scottsdale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Scottsdale, path = "Scottsdale.csv") 

#Seattle
a <- read_csv("SeattleSkills1.csv")
b <- read_csv("SeattleSkills2.csv")
c <- read_csv("SeattleSkills3.csv")
d <- read_csv("SeattleSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Seattle <- full_join(j1, j2, by="Term")

Seattle$City <- "Seattle"
Seattle$State <- "WA"

Seattle <- replace_na(Seattle, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Seattle, path = "Seattle.csv") 

#St.Louis
a <- read_csv("St.LouisSkills1.csv")
b <- read_csv("St.LouisSkills2.csv")
c <- read_csv("St.LouisSkills3.csv")
d <- read_csv("St.LouisSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
St.Louis <- full_join(j1, j2, by="Term")

St.Louis$City <- "St. Louis"
St.Louis$State <- "MO"

St.Louis <- replace_na(St.Louis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(St.Louis, path = "St.Louis.csv") 

#St.Paul
a <- read_csv("St.PaulSkills1.csv")
b <- read_csv("St.PaulSkills2.csv")
c <- read_csv("St.PaulSkills3.csv")
d <- read_csv("St.PaulSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
St.Paul <- full_join(j1, j2, by="Term")

St.Paul$City <- "St. Paul"
St.Paul$State <- "MN"

St.Paul <- replace_na(St.Paul, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(St.Paul, path = "St.Paul.csv") 

#St.Petersburg
a <- read_csv("St.PetersburgSkills1.csv")
b <- read_csv("St.PetersburgSkills2.csv")
c <- read_csv("St.PetersburgSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
St.Petersburg <- full_join(j1, c, by="Term")

St.Petersburg$City <- "St.Petersburg"
St.Petersburg$State <- "FL"

St.Petersburg <- replace_na(St.Petersburg, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(St.Petersburg, path = "St.Petersburg.csv") 

#Tampa
a <- read_csv("TampaSkills1.csv")
b <- read_csv("TampaSkills2.csv")
c <- read_csv("TampaSkills3.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")

j1 <- full_join(a, b, by="Term") 
Tampa <- full_join(j1, c, by="Term")

Tampa$City <- "Tampa"
Tampa$State <- "FL"

Tampa<- replace_na(Tampa, list(DAnalyst=0, DScientist=0, DArchitect=0))

write_csv(Tampa, path = "Tampa.csv") 

#Washington
a <- read_csv("WashingtonSkills1.csv")
b <- read_csv("WashingtonSkills2.csv")
c <- read_csv("WashingtonSkills3.csv")
d <- read_csv("WashingtonSkills4.csv")

names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")

j1 <- full_join(a, b, by="Term") 
j2 <- full_join(c, d, by="Term")
Washington <- full_join(j1, j2, by="Term")

Washington$City <- "Washington"
Washington$State <- "DC"

Washington <- replace_na(Washington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))

write_csv(Washington, path = "Washington.csv") 

Combinging all cities:

master_39cities <- union(Anaheim, Arlington) %>% union(Atlanta) %>% union(Aurora) %>% union(Austin) %>% union(Baltimore) %>% union(Boston) %>% union(Boulder) %>% union(Chicago) %>% union(Dallas) %>% union(Denver) %>% union(Fremont) %>% union(Garland) %>% union(Irving) %>% union(Jersey) %>% union(LongBeach) %>% union(LosAngeles) %>% union(Minneapolis) %>% union(Newark) %>% union(NewYork) %>% union(Oakland) %>% union(Philadelphia) %>% union(Plano) %>% union(SanFrancisco) %>% union(SanJose) %>% union(SantaAna) %>% union(Seattle) %>% union(St.Louis) %>% union(St.Paul) %>% union(Washington) %>% union(Chandler) %>% union(Columbus) %>% union(Gilbert) %>% union(Glendale) %>% union(Irvine) %>% union(Mesa) %>% union(Phoenix) %>% union(SanDiego) %>% union(Scottsdale)

write_csv(master_39cities, path="master_39cities.csv")

master_53cities <- master_39cities %>% select(-DEngineer) %>% union(Charlotte) %>% union(Cincinnati) %>% union(Detroit) %>% union(Durham) %>% union(FortWorth) %>% union(Hialeah) %>% union(Houston) %>% union(Miami) %>% union(Nashville) %>% union(Pittsburgh) %>% union(Portland) %>% union(Raleigh) %>% union(St.Petersburg) %>% union(Tampa)

write_csv(master_53cities, path="master_53cities.csv")

After the lengthy process of wrangling and combining the data, I now have three .csv files:

Specific Research questions

After the data collection process, I narrowed down my research questions to the following.

Exploratory Analysis

library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringr)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(SnowballC)  

Ranking of American cities based on job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers

Wrangling of num_posting data:

num_posting <- read_csv("num_posting_US.csv") #dataset that contains the number of job postings for each job title in each of the 53 cities

num_posting <- num_posting %>% separate(col=`City, State`, into=c("City", "State"), sep=",") 

names(num_posting) <- c("City", "State", "DAnalyst", "DScientist", "DArchitect", "DEngineer")
num_posting <- replace_na(num_posting, list(DEngineer=0)) #Since the NAs for DEngineer are from cities that didn't have enough job postings (less than 10). For ease of analysis, I will consider them to be equal to zero. 

num_posting_long <- num_posting %>% gather(job_title, num_posting, DAnalyst:DEngineer) #tidy data

Plot of number of job postings for each job title in 53 cities:

num_posting %>% 
  ggplot() +
  geom_line(aes(City, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(City, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(City, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(City, DEngineer, group=1, color="Data Engineer")) +
  ylab("Number of job postings") +
  ggtitle("Number of job ads in 53 U.S. cities") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

You can see that almost all 53 cities had the most number of job ads for Data analysts followed by those for Data scientists, except for San Francisco, San Jose and Seattle.

There were similar number of job ads for data engineers and data architects, except for New York, Newark, Jersey City, San Jose, San Francisco, Seattle, Oakland and Fremont where there was a significantly higher number of job ads for data engineers than data architects.

Plot of total number of job postings (for Data Analysts, Data Scientists, Data Architects & Data Engineers) in 53 cities:

num_posting_long %>% 
  group_by(City) %>%
  summarize(total_num_posting = sum(num_posting)) %>% 
  ggplot(aes(City, total_num_posting))+
  geom_point(color="purple") +
  ylab("Number of job postings") +
  ggtitle("Total number of job postings for Data Analysts, Data Scientists, Data Architects & Data Engineers") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        legend.text = element_text(size = 20))

In San Francisco, Newark, New York, Jersey City, Fremont, Oakland and Washington, there were a total of over 1000 job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers combined together. Boston & Seattle had over 500 whereas San Jose had over 800.

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in Canada

Creating a master dataset for Canada:

canada1 <- read_csv("canada1.csv")
canada2 <- read_csv("canada2.csv")
canada3 <- read_csv("canada3.csv")
canada4 <- read_csv("canada4.csv")

names(canada1) <- c("Term", "DAnalyst")
names(canada2) <- c("Term", "DScientist")
names(canada3) <- c("Term", "DArchitect")
names(canada4) <- c("Term", "DEngineer")

j1 <- full_join(canada1, canada2, by="Term") 
j2 <- full_join(canada3, canada4, by="Term")
canada <- full_join(j1, j2, by="Term")

canada <- replace_na(canada, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0)) 

canada_long <- canada %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy

Plot of data science tools vs. Job title for Canada

canada %>% 
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
  xlab("") + 
  ylab("Percentage appearing in job ads") +
  ggtitle("Data Science Tools in demand in Canada") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python. (US_avg)

For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS.

canada %>% select(Term, DAnalyst) %>% arrange(desc(DAnalyst))
## Source: local data frame [27 x 2]
## 
##          Term  DAnalyst
##         (chr)     (dbl)
## 1         SQL 43.222506
## 2       Excel 41.176471
## 3           R 16.112532
## 4     Tableau 16.112532
## 5      Python 12.276215
## 6         SAS 10.485934
## 7  JavaScript 10.230179
## 8        Perl  8.695652
## 9        Java  8.439898
## 10     Hadoop  6.649616
## ..        ...       ...

For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel. (US_avg)

For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.

canada %>% select(Term, DScientist) %>% arrange(desc(DScientist))
## Source: local data frame [27 x 2]
## 
##          Term DScientist
##         (chr)      (dbl)
## 1           R   34.37500
## 2      Python   34.37500
## 3         SQL   30.46875
## 4        Java   23.43750
## 5     Tableau   19.53125
## 6      Hadoop   18.75000
## 7         SAS   17.96875
## 8       Spark   17.18750
## 9  JavaScript   13.28125
## 10      Excel   11.71875
## ..        ...        ...

For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python. (avg_US)

For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.

canada %>% select(Term, DArchitect) %>% arrange(desc(DArchitect))
## Source: local data frame [27 x 2]
## 
##          Term DArchitect
##         (chr)      (dbl)
## 1         SQL  32.673267
## 2        Java  13.861386
## 3      Hadoop  13.861386
## 4     Tableau  11.881188
## 5      Python  11.881188
## 6  JavaScript   9.900990
## 7         C++   7.920792
## 8        Hive   7.920792
## 9         SAS   6.930693
## 10          R   5.940594
## ..        ...        ...

For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.

For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.

canada %>% select(Term, DEngineer) %>% arrange(desc(DEngineer))
## Source: local data frame [27 x 2]
## 
##      Term DEngineer
##     (chr)     (dbl)
## 1  Python  66.66667
## 2  Hadoop  62.96296
## 3   Spark  40.74074
## 4     SQL  37.03704
## 5   Scala  37.03704
## 6    Hive  33.33333
## 7    Java  18.51852
## 8   HBase  18.51852
## 9   NoSQL  14.81481
## 10    Pig  14.81481
## ..    ...       ...

It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in USA

Comparing demand of data science tools between Data analysts, Data scientists, Data architects & Data engineers in Canada

Creating a master dataset for USA nationwide count:

usa1 <- read_csv("national1.csv")
usa2 <- read_csv("national2.csv")
usa3 <- read_csv("national3.csv")
usa4 <- read_csv("national4.csv")

names(usa1) <- c("Term", "DAnalyst")
names(usa2) <- c("Term", "DScientist")
names(usa3) <- c("Term", "DArchitect")
names(usa4) <- c("Term", "DEngineer")

j1 <- full_join(usa1, usa2, by="Term") 
j2 <- full_join(usa3, usa4, by="Term")
usa <- full_join(j1, j2, by="Term")

USA <- replace_na(usa, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0)) 

USA_long <- usa %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy

Plot of data science tools vs. Job title for USA

USA %>% 
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
  xlab("") + 
  ylab("Percentage appearing in job ads") +
  ggtitle("Data Science Tools in demand in USA") +
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python. (US_avg)

For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS. (Canada)

For data analysts, 56.2% of job ads asked for Excel, 27.5% for SQL, 12.5% for Tableau, 12.4% for R, 10.3% for Java and 5.5% for SAS.

USA %>% select(Term, DAnalyst) %>% arrange(desc(DAnalyst))
## Source: local data frame [30 x 2]
## 
##          Term  DAnalyst
##         (chr)     (dbl)
## 1       Excel 56.221854
## 2         SQL 27.470965
## 3     Tableau 12.491112
## 4           R 12.396302
## 5        Java 10.263096
## 6         SAS  5.451529
## 7      Hadoop  3.602749
## 8      Python  2.939085
## 9        SPSS  2.844276
## 10 JavaScript  2.630955
## ..        ...       ...

For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel. (US_avg)

For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.

For data scientists, 37.7% of job ads asked for Python, 36.6% for R, 29.4% for SQL, 27.0% for Hadoop, 24.9% for Java, 18.7% for Matlab, 12.8% for SAS and only 4.6% for Excel.

USA %>% select(Term, DScientist) %>% arrange(desc(DScientist)) 
## Source: local data frame [30 x 2]
## 
##      Term DScientist
##     (chr)      (dbl)
## 1  Python   37.65892
## 2       R   36.56291
## 3     SQL   29.41692
## 4  Hadoop   27.04954
## 5    Java   24.85752
## 6  Matlab   18.67602
## 7   Spark   13.15213
## 8     SAS   12.80140
## 9    Hive   12.71372
## 10    C++   11.74923
## ..    ...        ...

For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python. (avg_US)

For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.

For data architects, 45.2% asked for SQL, 18.0% for Java, 17.2% for Java, 16.1% for Tableau, 9.1% for Python, 7.9% for R.

USA %>% select(Term, DArchitect) %>% arrange(desc(DArchitect))
## Source: local data frame [30 x 2]
## 
##          Term DArchitect
##         (chr)      (dbl)
## 1         SQL  45.209581
## 2      Hadoop  17.964072
## 3        Java  17.215569
## 4     Tableau  16.092814
## 5       Spark  10.179641
## 6       NoSQL  10.029940
## 7      Python   9.131737
## 8        Hive   8.532934
## 9           R   7.859281
## 10 JavaScript   7.784431
## ..        ...        ...

For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.

For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.

For data engineers, 47.5% asked for SQL, 36.6% for Hadoop, 30.6% for Python, 28.7% for Java and 9.1% for R.

USA %>% select(Term, DEngineer) %>% arrange(desc(DEngineer)) 
## Source: local data frame [30 x 2]
## 
##      Term DEngineer
##     (chr)     (dbl)
## 1     SQL  47.48031
## 2  Hadoop  36.53543
## 3  Python  30.62992
## 4    Java  28.66142
## 5   Spark  20.78740
## 6    Hive  17.79528
## 7   NoSQL  12.99213
## 8   HBase  11.96850
## 9     Pig  11.10236
## 10  Scala  11.02362
## ..    ...       ...

It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.

Loading the datasets with the individual cities’ data:

master_39cities <- read_csv("master_39cities.csv")
master_53cities <- read_csv("master_53cities.csv")
master_39cities_long <- master_39cities %>% gather(job_title, percentage, DAnalyst: DEngineer) #tidy
master_53cities_long <- master_53cities %>% gather(job_title, percentage, DAnalyst: DArchitect) #tidy

Below, I’m calculating the average appearance of the term (name of Data Science Tools) in job postings gathered.

Average apperance of the data science tools in job ads for Data Analysts, Data Scientists, Data Architects and Data Engineers in 39 US cities

avg_appearance_39 <- master_39cities %>% 
  group_by(Term) %>% 
  summarize(avg_DAnalyst = mean(DAnalyst), avg_DScientist = mean(DScientist), 
            avg_DArchitect = mean(DArchitect), avg_DEngineer = mean(DEngineer)) 

avg_appearance_39 %>%
  ggplot() +
  geom_line(aes(Term, avg_DAnalyst, group=1, color="Data Analyst")) + 
  geom_line(aes(Term, avg_DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, avg_DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, avg_DEngineer, group=1, color="Data Engineer")) +
  xlab("") +
  ylab("Average percentage of appearance in job ads") +
  ggtitle("Data Science tools most common in job ads (39 U.S. cities)")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

For data analysts, 38.3% of job ads asked for SQL, 34.2% for Excel, 11.4% for SAS, 11.2% for R, 10.9% for Tableau and only 7.1% for Python.

avg_appearance_39 %>% select(Term, avg_DAnalyst) %>% arrange(desc(avg_DAnalyst))
## Source: local data frame [30 x 2]
## 
##          Term avg_DAnalyst
##         (chr)        (dbl)
## 1         SQL    38.329201
## 2       Excel    34.290584
## 3         SAS    11.357391
## 4           R    11.196742
## 5     Tableau    10.968081
## 6      Python     7.139954
## 7  JavaScript     6.549579
## 8        Java     6.119270
## 9      Hadoop     4.987034
## 10       SPSS     3.666810
## ..        ...          ...

For data scientists, 47.8% of job ads asked for R, 44.1% for Python, 36.3% for Hadoop, 31.4 for SQL, 28.2% for Java, 26.6% for SAS and only 8.7% for Excel.

avg_appearance_39 %>% select(Term, avg_DScientist) %>% arrange(desc(avg_DScientist))
## Source: local data frame [30 x 2]
## 
##       Term avg_DScientist
##      (chr)          (dbl)
## 1        R       47.77967
## 2   Python       44.11741
## 3   Hadoop       36.30741
## 4      SQL       31.42600
## 5     Java       28.16051
## 6      SAS       26.59200
## 7    Spark       17.80438
## 8     Hive       16.76739
## 9   Matlab       16.08920
## 10 Tableau       15.07856
## ..     ...            ...

For data architects, 45.3% asked for SQL, 32.9% for Hadoop, 31.7% for Java, 21.8% for NoSQL, 11.5% for R and 10.6% for Python.

avg_appearance_39 %>% select(Term, avg_DArchitect) %>% arrange(desc(avg_DArchitect))
## Source: local data frame [30 x 2]
## 
##         Term avg_DArchitect
##        (chr)          (dbl)
## 1        SQL       45.26388
## 2     Hadoop       32.86227
## 3       Java       31.69227
## 4      NoSQL       21.76291
## 5       Hive       14.17215
## 6        Pig       11.92220
## 7          R       11.55942
## 8  Cassandra       11.02397
## 9      Spark       10.65855
## 10    Python       10.56799
## ..       ...            ...

For data engineers, 58.6% asked for Hadoop, 55.8% for Java, 44.6% for Spark, 31.2% for SQL, 27.6% for Python and only 7.2% for R.

avg_appearance_39 %>% select(Term, avg_DEngineer) %>% arrange(desc(avg_DEngineer))
## Source: local data frame [30 x 2]
## 
##         Term avg_DEngineer
##        (chr)         (dbl)
## 1     Hadoop      58.60279
## 2       Java      55.79221
## 3      Spark      44.61854
## 4       Hive      38.81320
## 5      HBase      37.50008
## 6  Cassandra      31.62938
## 7        SQL      31.17291
## 8        Pig      30.94266
## 9  MapReduce      30.91452
## 10    Python      27.63428
## ..       ...           ...

It’s really interesting to see that the top 2 most-mentioned tools for data analysts are SQL & Excel whereas those for data scientists are R & Python. The percentage of job ads that asked for SQL was pretty consistent across all four job titles. Same goes for Hadoop, except in job ads for Data analysts where it was asked only 4.9% of the time.

Average apperance of the terms for Data Analysts, Data Scientists & Data Architects in 53 US cities

#I"m also checking if the trends are any different when looked at 53 cities. Since the dataset does not have the numbers for data engineers, I'll only be looking at data analysts, data scientists & data architects.  
avg_appearance_53 <- master_53cities %>% group_by(Term) %>% summarize(avg_DAnalyst = mean(DAnalyst), avg_DScientist = mean(DScientist), avg_DArchitect = mean(DArchitect)) 

avg_appearance_53 %>% 
  ggplot() +
  geom_line(aes(Term, avg_DAnalyst, group=1, color="Data Analyst")) + 
  geom_line(aes(Term, avg_DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, avg_DArchitect, group=1, color="Data Architect")) +
  xlab("") +
  ylab("Average percentage of appearance in job ads") +
  ggtitle("Data Science tools most common in job ads (53 U.S. cities)")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

You can observe similar trends for the 53 cities as well.

avg_appearance_53 %>% select(Term, avg_DAnalyst) %>% arrange(desc(avg_DAnalyst))
## Source: local data frame [30 x 2]
## 
##          Term avg_DAnalyst
##         (chr)        (dbl)
## 1         SQL    40.192633
## 2       Excel    34.378379
## 3         SAS    11.604102
## 4     Tableau    10.653617
## 5           R     9.524691
## 6      Python     6.890490
## 7  JavaScript     6.689093
## 8        Java     5.842706
## 9      Hadoop     4.882816
## 10       SPSS     3.551104
## ..        ...          ...
avg_appearance_53 %>% select(Term, avg_DScientist) %>% arrange(desc(avg_DScientist))
## Source: local data frame [30 x 2]
## 
##       Term avg_DScientist
##      (chr)          (dbl)
## 1        R       44.95279
## 2   Python       42.06721
## 3   Hadoop       32.43743
## 4      SQL       30.70055
## 5      SAS       27.11487
## 6     Java       24.88270
## 7   Matlab       18.85610
## 8  Tableau       17.41630
## 9     Hive       17.38719
## 10   Spark       15.80257
## ..     ...            ...
avg_appearance_53 %>% select(Term, avg_DArchitect) %>% arrange(desc(avg_DArchitect))
## Source: local data frame [30 x 2]
## 
##         Term avg_DArchitect
##        (chr)          (dbl)
## 1        SQL       47.65299
## 2     Hadoop       31.11683
## 3       Java       28.27701
## 4      NoSQL       22.02256
## 5       Hive       15.46513
## 6        Pig       12.77114
## 7  Cassandra       11.95777
## 8      Spark       11.28517
## 9      HBase       10.33732
## 10         R       10.33601
## ..       ...            ...

Data Science tools in demand for Data Analysts in 53 cities

master_53cities %>% 
  ggplot(aes(Term, DAnalyst)) + 
  geom_boxplot(color="red") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Analysts")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Scientists in 53 cities

master_53cities %>% 
  ggplot(aes(Term, DScientist)) + 
  geom_boxplot(color="purple") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Scientists")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Architects in 53 cities

master_53cities %>% 
  ggplot(aes(Term, DArchitect)) + 
  geom_boxplot(color="green") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Architects")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Data Science Tools in demand for Data Engineers in 39 cities (Less cities had job ads for data engineers)

master_39cities %>% 
  ggplot(aes(Term, DEngineer)) + 
  geom_boxplot(color="turquoise") +
  xlab("") +
  ylab("Percentage of appearance in job ads") +
  ggtitle("Data Science Tools in demand for Data Engineers")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Comparing demand of Data Science Tools in different U.S. cities

Newark vs. Jersey City vs. New York vs. Washington

master_39cities%>% 
  filter (City == "New York" | City=="Newark" | City =="Jersey" | City =="Washington") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Oakland vs. Boston vs. San Francisco vs. Fremont

master_39cities%>% 
  filter (City == "Boston" | City=="Oakland" | City =="San Francisco" | City =="Fremont") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Los Angeles vs. Atlanta vs. Long Beach vs. Irving

master_39cities%>% 
  filter (City == "Los Angeles" | City=="Atlanta" | City =="Long Beach" | City =="Irving") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

San Jose vs. Chicago vs. Anaheim vs. Dallas

master_39cities%>% 
  filter (City == "San Jose" | City=="Chicago" | City =="Anaheim" | City =="Dallas") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Garland vs. Plano vs. Arlington vs. Philadelphia

master_39cities%>% 
  filter (City == "Garland" | City=="Plano" | City =="Arlington" | City =="Philadelphia") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Seattle vs. Minneapolis vs. St. Paul vs. Baltimore

master_39cities%>% 
  filter (City == "Seattle" | City=="Minneapolis" | City =="St. Paul" | City =="Baltimore") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Denver vs. Santa Ana vs. St. Louis vs. Aurora

master_39cities%>% 
  filter (City == "Denver" | City=="Santa Ana" | City =="St. Louis" | City =="Aurora") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Mesa vs. Phoenix vs. Scottsdale vs. Irvine

master_39cities%>% 
  filter (City == "Mesa" | City=="Phoenix" | City =="Scottsdale" | City =="Irvine") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Chandler vs. Gilbert vs. Glendale vs. Boulder

master_39cities%>% 
  filter (City == "Chandler" | City=="Gilbert" | City =="Glendale" | City =="Boulder") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Austin vs. Columbus vs. San Diego

master_39cities%>% 
  filter (City == "Austin" | City=="Columbus" | City =="San Diego") %>%
  ggplot() +
  geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
  geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
  geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
  geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) + 
  facet_wrap(~City) +
  xlab("") +
  ylab("Percentage of appearance in job postings") +
  ggtitle("Percentage of job ads with a data science tool")+
  theme( text=element_text(size=20), 
        axis.text.x=element_text(angle=90, vjust=1), 
        plot.title = element_text(size = rel(1.1)), 
        legend.text = element_text(size = 20))

Stackoverflow data

load("data_scientist.rda")

stackoverflow <- data_scientist %>% select(JobId, Title, Tags, LocationString, Description, CountryCode, StateCode)
Description <- stackoverflow %>% select(Description)

Creating general functions to count the terms (names of the data science tools)

count1 <- function(term, lang) { #Counting just one variation of the term
  lang = str_count(Description$Description, term) %>% data.frame()
  names(lang) <- "count"
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1)) 
  #Scenerio 1: If the count is 0, the term didn't appear in the job ad. 
  #Scenerio 2: If the count is not equal to zero, the term appeared 1 or more times in the job ad. 
  #Since we are only interested in the proportion of job ads that contain the term, we will make the count = 1 for scenerio 2. 
  lang
}

count2 <- function(term1, term2, lang) { #Counting 2 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  lang <- a + b
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count3 <- function(term1, term2, term3, lang) { #Counting 3 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  c = str_count(Description$Description, term3) %>% data.frame()
  names(c) <- "count"
  lang <- a + b + c
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count7 <- function(term1, term2, term3, term4, term5, term6, term7, lang) { #Counting 7 variations of the term
  a = str_count(Description$Description, term1) %>% data.frame()
  names(a) <- "count"
  b = str_count(Description$Description, term2) %>% data.frame()
  names(b) <- "count"
  c = str_count(Description$Description, term3) %>% data.frame()
  names(c) <- "count"
  d = str_count(Description$Description, term4) %>% data.frame()
  names(d) <- "count"
  e = str_count(Description$Description, term5) %>% data.frame()
  names(e) <- "count"
  f = str_count(Description$Description, term6) %>% data.frame()
  names(f) <- "count"
  g = str_count(Description$Description, term7) %>% data.frame()
  names(g) <- "count"
  lang <- a + b + c +d + e+ f + g
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

count3_fixed3 <- function(term1, term2, term3, lang){ #Counting 3 fixed variations of the term
  a <- grepl(term1, Description$Description) %>% data.frame()
  #When I was trying to count R, `str_count` was counting all R's in words, not just the specific one separate word "R". 
  #Therefore, I switched to grepl, that allows me to do so. 
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  lang <- a+b+c
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
  lang
}

count6_fixed6 <- function(term1, term2, term3, term4, term5, term6, lang){ #Counting 6 fixed variations of the term
  a <- grepl(term1, Description$Description) %>% data.frame() 
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  d <- grepl(term4, Description$Description) %>% data.frame()
  names(d) <- "count"
  d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
  e <- grepl(term5, Description$Description) %>% data.frame()
  names(e) <- "count"
  e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
  f <- grepl(term6, Description$Description) %>% data.frame()
  names(f) <- "count"
  f <- f %>% mutate(count=ifelse(count==TRUE, 1, 0))
  lang <- a+b+c+d+e+f
  lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
  lang
}

count9_fixed5 <- function(term1, term2, term3, term4, #term1-5 are fixed words 
                   term5, term6, term7, term8, term9, lang) {
  a <- grepl(term1, Description$Description) %>% data.frame()
  names(a) <- "count"
  a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
  b <- grepl(term2, Description$Description) %>% data.frame()
  names(b) <- "count"
  b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
  c <- grepl(term3, Description$Description) %>% data.frame()
  names(c) <- "count"
  c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
  d <- grepl(term4, Description$Description) %>% data.frame()
  names(d) <- "count"
  d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
  e <- grepl(term5, Description$Description) %>% data.frame()
  names(e) <- "count"
  e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
  f = str_count(Description$Description, term6) %>% data.frame()
  names(f) <- "count"
  g = str_count(Description$Description, term7) %>% data.frame()
  names(g) <- "count"
  h = str_count(Description$Description, term8) %>% data.frame()
  names(h) <- "count"
  i = str_count(Description$Description, term9) %>% data.frame()
  names(i) <- "count"
  lang <- a + b + c +d + e+ f + g + h+ i
  lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
  lang
}

Creating stack_count that contains the columns that indicate if a particular job ad had the term or not (yes = 1, no =0)

R <- count3_fixed3("\\<R\\>", "\\<R.\\>", "\\<R,\\>", R) 
Python <- count2("Python", "python", Python) 
Java <- count2("Java", "java", Java)
C_plus_plus <- count2("C\\+\\+", "c\\+\\+", C_plus_plus)
Ruby <- count2("Ruby", "ruby", Ruby)
Perl <- count2("Perl", "perl", Perl)
Matlab <- count2("Matlab", "matlab", Matlab)
Javascript <- count2("Javascript", "javascript", Javascript)
Scala <- count2("Scala", "scala", Scala)
Excel <- count2("Excel", "excel",Excel)
Tableau <- count2("Tableau","tableau", Tableau)
D3.js <- count2("D3.js","d3.js", D3.js)
SAS <- count2("SAS","sas", SAS)
SPSS <- count2("SPSS","spss", SPSS)
D3 <- count2("d3", "D3", D3)
Hadoop <- count2("Hadoop", "hadoop", Hadoop)
MapReduce <- count2("MapReduce","mapreduce", MapReduce)
Spark <- count2("Spark", "spark", Spark)
Pig <- count2("Pig","pig", Pig)
Hive <- count2("Hive", "hive", Hive)
Shark <- count2("Shark", "shark", Shark)
Oozie <- count2("Oozie", "oozie", Oozie)
Zookeeper <- count2("Zookeeper", "zookeeper", Zookeeper)
Flume <- count2("Flume", "flume", Flume)
Mahout <- count2("Mahout", "mahout", Mahout)
SQL <- count6_fixed6("\\<SQL\\>", "\\<sql\\>","\\<SQL.\\>", "\\<SQL,\\>", "\\<sql.\\>", "\\<sql,\\>", SQL)
NoSQL <- count2("NoSQL", "nosql", NoSQL) 
HBase <- count2("HBase", "hbase", HBase)
Cassandra <- count2("Cassandra", "cassandra", Cassandra)
MongoDB <- count2("MongoDB","mongodb", MongoDB)

MachineLearning <- count2("Machine Learning", "machine learning", MachineLearning)
BigData <- count3("Big Data", "big data", "Big data", BigData)
PhD <- count3("PhD", "Ph.D", "Ph.D.", PhD)
Bachelors <- count7("Undergraduate", "undergraduate", "BA", "BS", "BSc", "Bachelor\'s", "Bachelors", Bachelors)

Masters <- count9_fixed5("\\<Ms\\>", "\\<MS\\>", "\\<M.S.\\>", "\\<MA\\>", "\\<MBA\\>", "MS degree","Master\'s", "Master", "Masters", Masters)

stack_count <- cbind(R, Python, Java, C_plus_plus, Ruby, Perl, Matlab, Javascript, Scala, Excel, Tableau, D3.js, SAS, SPSS, D3, Hadoop, MapReduce, Spark, Pig, Hive, Shark, Oozie, Zookeeper, Flume, Mahout, SQL, NoSQL, HBase, Cassandra, MongoDB, MachineLearning, BigData, PhD, Bachelors, Masters)

names(stack_count) <- c("R", "Python", "Java", "C_plus_plus", "Ruby", "Perl", "Matlab", "Javascript", "Scala", "Excel", "Tableau", "D3.js", "SAS", "SPSS", "D3", "Hadoop", "MapReduce", "Spark", "Pig", "Hive", "Shark", "Oozie", "Zookeeper", "Flume", "Mahout", "SQL", "NoSQL", "HBase", "Cassandra", "MongoDB", "MachineLearning", "BigData", "PhD", "Bachelors", "Masters")

stack_count <- cbind(stackoverflow, stack_count)

Creating sum_count with columns term, count (total number job ads containg the term) & freq (% of job ads containing the term)

sum_count <- stack_count %>% 
  summarize(R = sum(R), Python = sum(Python), Java= sum(Java), 
            `C++` = sum(C_plus_plus), Ruby= sum(Ruby), Perl= sum(Perl),
            Matlab= sum(Matlab), Javascript= sum(Javascript), Scala= sum(Scala),
            Excel= sum(Excel), Tableau= sum(Tableau), D3.js=sum(D3.js),
            SAS=sum(SAS), SPSS=sum(SPSS), D3=sum(D3), Hadoop=sum(Hadoop),
            MapReduce=sum(MapReduce),Spark=sum(Spark), Pig=sum(Pig), Hive=sum(Hive),
            Shark=sum(Shark), Oozie=sum(Oozie),  Zookeeper=sum(Zookeeper),
            Flume=sum(Flume), Mahout=sum(Mahout), SQL=sum(SQL), NoSQL=sum(NoSQL),
            HBase=sum(HBase), Cassandra=sum(Cassandra),
            MongoDB=sum(MongoDB),MachineLearning=sum(MachineLearning),
            BigData=sum(BigData), PhD=sum(PhD), Bachelors=sum(Bachelors),
            Masters=sum(Masters)) %>% 
  tbl_df() %>% 
  gather(term, count, R:Masters) #count = number of job ads that include the `term`

sum_count <- sum_count %>% mutate(freq=count/362, count) 
#freq = % of job ads that contain the `term`

Word cloud of Data Science Tools mentioned in Stack Overflow Job Ads

pal2 <- brewer.pal(8,"Dark2")
wordcloud(sum_count$term, sum_count$freq, scale = c(4, 1.5), random.order = FALSE, colors=pal2, rot.per = 0, fixed.asp = FALSE)

Creating a word cloud of Stack Overflow Job Ad Descriptions

Description1 <- Description %>% 
  mutate(Description= gsub("\\[|\\]", "", Description)) %>% 
  mutate(Description= gsub("looking", "", Description)) %>% 
  mutate(Description= gsub("looking", "", Description)) %>% 
  mutate(Description= gsub("position", "", Description)) %>% 
  mutate(Description= gsub("one", "", Description)) %>% 
  mutate(Description= gsub("rel=\"nofollow\"", "", Description)) %>% 
  mutate(Description= gsub("also", "", Description)) %>% 
  mutate(Description= gsub("company", "", Description)) %>% 
  mutate(Description= gsub("<li>experience", "", Description)) %>% 
  mutate(Description= gsub("well", "", Description)) %>% 
  mutate(Description= gsub("<li>work", "", Description)) %>% 
  mutate(Description= gsub("<li><span>", "", Description)) %>% 
  mutate(Description= gsub("<p>", "", Description)) %>% 
  mutate(Description= gsub("</p>", "", Description)) %>% 
  mutate(Description= gsub("<p><strong>", "", Description)) %>% 
  mutate(Description= gsub("<ul>", "", Description)) %>% 
  mutate(Description= gsub("</ul>", "", Description)) %>% 
  mutate(Description= gsub("<li>", "", Description)) %>% 
  mutate(Description= gsub("</li>", "", Description)) %>% 
  mutate(Description= gsub("&amp;", "", Description)) %>% 
  mutate(Description= gsub("&nbsp;", "", Description)) %>%
  mutate(Description= gsub("</strong>", "", Description)) %>% 
  mutate(Description= gsub("<strong>", "", Description)) %>%
  mutate(Description= gsub("<span>", "", Description)) %>% 
  mutate(Description= gsub("<br>", "", Description)) %>% 
  mutate(Description= gsub("<em>", "", Description)) %>% 
  mutate(Description= gsub("</span>", "", Description)) %>% 
  mutate(Description= gsub("<a href", "", Description)) %>% 
  mutate(Description= gsub("=\"http://www", "", Description)) %>% 
  mutate(Description= gsub("&rsquo", "", Description)) %>%
  mutate(Description= gsub("youll", "", Description)) %>% 
  mutate(Description= gsub("will", "", Description)) %>% 
  mutate(Description= gsub("within", "", Description)) %>% 
  mutate(Description= gsub("can", "", Description)) %>% 
  mutate(Description= gsub("using", "", Description)) %>% 
  mutate(Description= gsub("apply", "", Description)) %>% 
  mutate(Description= gsub("part", "", Description)) %>%
  mutate(Description=gsub("analysing", "analyse", Description))
  
r_words <- c("right", "youll", "get", "next", "high", "bull","way", "etc", "didate", "based", "every", "take", "ndash", "real", "core", "key", "day", "able", "van", "time", "used", "highly", "years", "sets", "want", "seeking", "senior", "related", "join", "ability", "new", "teams", "role", "use", "like", "make", "across", "provide", "drive", "help", "work", "working", "including", "questions", "environment", "implement", "developing", "products", "responsible", "need", "closely", "recommendations", "just", "ing", "set", "andor")

review_text <- paste(Description1$Description, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, r_words)

dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)

frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
##       data       team experience   business  scientist   learning 
##       2249        564        382        378        344        326
words <- names(frequency)

wordcloud(words, frequency, scale=c(3.5,.7), min.freq=30, random.order=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 250, rot.per = 0, fixed.asp = FALSE, random.color = TRUE)